7 Applied example
Please note that the data set used in this section is fairly large (~70,000 rows) and so the figures below might take some time to load.
This section demonstrates the workflow and features of plotscaper using a large real-world dataset focused on a pressing global issue: mental health.
Mental health is a global concern. In developed nations, mental health disorders are primary contributor to years lived with disability, significantly impacting both the well-being of individuals and the economic productivity of entire countries (Organization 2022). This issue, however, extends beyond developed nations. The global burden of mental health disorders has been steadily rising over the recent decades, a trend which is particularly concerning in low-income countries where access to mental health services is even more limited (Patel et al. 2018).
Having had personal experience with friends and relatives impacted by mental health issues, as well as having majored in psychology during my undergraduate studies, I have had a long-standing interest in the topic. It is clear to me that mental health is a key challenge that the world is facing today, and the first step towards solving it will require clearly identifying key trends and patterns. For this reason, I chose to dedicate this section to an exploration of a large longitudinal mental health data set.
7.0.1 About the data set
The Institute of Health Information and Statistics of the Czech Republic (IHIS, ÚZIS in Czech) is a government agency established by the Czech Ministry of Health. Its primary roles is to collect, process, and report on medical data within the country of Czechia (ÚZIS 2024). Of interest, the institute provides high-quality, open-access medical data, including information about the use and manufacture of medicines, summaries of fiscal and employment records in medical facilities, and various epidemiological data sets.
The Long-term care data set (Soukupová et al. 2023) contains longitudinal information about long-term psychiatric care in Czechia. More specifically, it contains aggregated data on individuals released from psychiatric care facilities between 2010 and 2022. It includes information such the region of the treatment facility, the sex of the patients, age category, diagnosis based on the international ICD-10 classification (World Health Organization 2024a, 2024b), the number of hospitalizations, and the total number of days spent in care by the given subset of patients.
Here’s the data set at a quick glance:
## Rows: 68,115
## Columns: 12
## $ year [3m[38;5;246m<int>[39m[23m 2019, 2016, 2011, 2013, 2019, 2013, 2018, 2017, 2019, 2015, 20…
## $ region_code [3m[38;5;246m<chr>[39m[23m "CZ071", "CZ064", "CZ080", "CZ072", "CZ080", "CZ053", "CZ080",…
## $ region [3m[38;5;246m<chr>[39m[23m "Olomoucký kraj", "Jihomoravský kraj", "Moravskoslezský kraj",…
## $ sex [3m[38;5;246m<chr>[39m[23m "female", "male", "male", "female", "male", "female", "male", …
## $ diagnosis [3m[38;5;246m<chr>[39m[23m "f10", "f2", "f7", "f4 without f42", "f60–f61", "f32–f33", "f3…
## $ reason_for_termination [3m[38;5;246m<chr>[39m[23m "release", "early termination", "early termination", "early te…
## $ age_category [3m[38;5;246m<chr>[39m[23m "40–49", "30–39", "30–39", "40–49", "50–59", "40–49", "60–69",…
## $ stay_category [3m[38;5;246m<chr>[39m[23m "short-term", "medium-term", "short-term", "short-term", "shor…
## $ field [3m[38;5;246m<chr>[39m[23m "psychiatry", "psychiatry", "psychiatry", "psychiatry", "psych…
## $ care_category [3m[38;5;246m<chr>[39m[23m "adult", "adult", "adult", "adult", "adult", "adult", "adult",…
## $ cases [3m[38;5;246m<int>[39m[23m 13, 3, 2, 3, 1, 2, 2, 32, 2, 2, 1, 12, 1, 1, 2, 9, 18, 1, 1, 1…
## $ days [3m[38;5;246m<int>[39m[23m 196, 345, 38, 108, 1, 47, 319, 3813, 120, 256, 151, 574, 49, 2…
The data contains over 68,000 rows, totaling over 410,000 hospitalizations. Each row records the number patients with a particular set of of characteristics released from a treatment facility during a given year, and the number of days the patients spent in treatment in total.
The original dataset used Czech column names and category labels. To make the analysis more easily accessible to non-Czech speakers, I took the liberty of translating most of these to English (excluding the region variable). The translation script is available in the thesis repository, at the following path: ./data/longterm_care_translate.R. Additionally, the data set website contains a JSON schema with a text description of each of the variables (Soukupová et al. 2023). I took the liberty of translating these descriptions as well, and provide them below, in table Table 7.1:
| Translated name | Original name | Description |
|---|---|---|
| year | rok | The year hospitalization was terminated |
| region_code | kraj_kod | Region code based on the NUTS 3 classification |
| region | kraj_nazev | Region where the facility was located |
| sex | pohlavi | Classification of patients’ sex |
| diagnosis | zakladni_diagnoza | The primary diagnosis of the psychiatric disorder based on the ICD-10 classification |
| reason_for_termination | ukonceni | The reason for termination of care |
| age_category | vekova_kategorie | Classification of patients’ age category |
| stay_category | kategorie_delky_hospitalizace | Classification of hospitalization based on length: short-term (< 3 months), medium-term (3-6 months), and long-term (6+ months) |
| field | obor | The field of psychiatric care |
| care_category | kategorie_pece | Classification of care: child or adult |
| cases | pocet_hospitalizaci | The total number of cases/hospitalizations in the given subgroup of patients |
| days | delka_hospitlizace | The total time spent in care, in days (= sum of the number of days all patients in a given subgroup spent in care) |
7.0.2 Interactive exploration
Let’s jump into exploring the data.
7.0.2.1 The relationship between cases and days
The first to note about the data set is that the data has been aggregated, such that each row represents the combined number of releases within a given subset of patients. For example, the first row indicates that, in the year 2019, 13 women aged 40-49 were released from treatment facilities in Olomoucký kraj (region), after receiving short-term care for F10 ICD-10 diagnosis (mental and behavioural disorders due to alcohol use, World Health Organization 2024a) for a sum total of 196 days:
## year region_code region sex diagnosis reason_for_termination age_category
## 1 2019 CZ071 Olomoucký kraj female f10 release 40–49
## stay_category field care_category cases days
## 1 short-term psychiatry adult 13 196
Thus, the two primary continuous variables of interest are cases (the number of patients in a given subgroup released from care) and days (the total number of days the given subgroup of patients spent in care). Intuitively, we should expect a fairly linear relationship between these variables, such that a larger group of patients should spend a greater number of days in care, in total. We can use plotscaper to visualize this relationship via a scatterplot:
library(plotscaper) # Load in the package
create_schema(df) |> # Create a schema that can be modified declaratively
add_scatterplot(c("cases", "days")) |>
render() # Render the figureInterestingly, the data does not exhibit a simple linear relationship. Instead, there seems to be a leaf-shaped pattern, with three distinct, roughly linear “leaflets”, suggesting a pooling effect. Closer inspection of the data reveals that the stay_category variable has three levels, corresponding to short-term (< 3 months), medium-term (3-6 months), and long-term (6+ months) care. Color-coding the cases by these categories indeed confirms that these correspond to the three observed leaflets:
df |>
create_schema() |>
add_scatterplot(c("cases", "days")) |>
add_barplot(c("stay_category", "cases")) |> # y-axis isn't just count but weighted by cases
assign_cases(which(df$stay_category == "short-term"), 1) |> # Mark short-term cases green
assign_cases(which(df$stay_category == "long-term"), 2) |> # Mark long-term cases red
render()Try clicking on the bars in the barplot to confirm there is a fairly minimal overlap between the data points in the three categories.
However, the pooling effect does not fully explain the absence of points between the three leaflets. If the distribution of cases and days within each of the three stay_category levels were uniform, we should expect to see more points within the gaps. This suggests a potential selection process, where patients are less likely to be discharged at durations near the category boundaries. We can confirm this by plotting the average number of days a group of patients spent in care:
df$avg_days <- df$days / df$cases
df |>
create_schema() |>
add_scatterplot(c("cases", "avg_days")) |>
add_barplot(c("stay_category", "cases")) |>
assign_cases(which(df$stay_category == "short-term"), 1) |>
assign_cases(which(df$stay_category == "long-term"), 2) |>
set_scale("scatterplot1", "y", transformation = "log10", default = TRUE) |>
render()Now we can clearly see the gaps between the three different distributions along the y-axis.
Try querying the points near the gaps by pressing the Q key and hovering over them. You will observe that these gaps roughly span the 60-90 day and 150-210 day ranges, corresponding to 2-3 months and 5-7 months, respectively. This is a strong indication of a selection process is at work. Specifically, it seems that patients staying for more than two months are likely to be transitioned to medium-term care and kept around for longer, while those staying for more than five months are similarly likely to be moved to long-term care. There are likely mundane administrative or health-care provider reasons for this, however, it is still an interesting pattern.
7.0.2.2 Number of cases over time
An important question to answer is how has the number of patients and the number of days they spent in care changed over time. We can investigate this by plotting the same scatterplot as we did in the section above, as well as two barplots showing the total number of cases and days within each year. We can also include a histogram of the number of days, for good measure:
schema <- df |>
create_schema() |>
add_scatterplot(c("cases", "days")) |>
add_barplot(c("year", "cases")) |>
add_barplot(c("year", "days")) |>
add_histogram(c("days")) |>
set_parameters("histogram1", width = 20) # Set histogram binwidth to 20 days
schema |> render()From the barplots, we can immediately see an interesting pattern: while the numbers of cases seem to have declined over time, the number of days patients spend in care seem to seem to have stayed fairly constant. This suggest that while fewer patients are being hospitalized, they are spending longer time in care.
We can indeed confirm this interactively. Try clicking on the bars corresponding to the year 2010 and 2022, in either of the two barplots (feel free to mark either of the bars in red or green by holding the 1 or 2 keys and clicking the bars). It is clearly visible that, compared with 2010, in 2022 there were more patients in long term care, and the relationship between the number of cases and the total number of days was steeper.
While the declining number of cases over time might initially appear positive, the constant number days in treatment suggests a more concerning trend. Specifically, given that treatment facility placements are limited, the increasing number of patients staying in care for longer durations may be burdening the healthcare system, reducing its capacity to serve new patients.
We can also scrutinize the trend more closely by zooming into the histogram: